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Abstract 

' The tree-metric theorem provides a necessary and sufficient condi- 

, tion for a dissimilarity matrix to be a tree metric, and has served as the 

• ' foundation for numerous distance-based reconstruction methods in phy- 

I logenetics. Our main result is an extension of the tree-metric theorem to 

. more general dissimilarity maps. In particular, we show that a tree with n 

leaves is reconstructible from the weights of the m-leaf subtrees provided 
that n > 2m — 1. 

> ; 1 Introduction 

' The problem of reconstructing a graph from measurements of distances between 

certain nodes was first proposed by Hakimi and Yau 6 in 1965, and has since 
developed into a myriad of related questions arising from diverse applications 
fT^ . such as phylogeny reconstruction and internet tomography. These problems all 

' have in common the notion of a graph realization of a matrix. That is, if D 

fH , is a matrix whose rows and columns are indexed by a set X, then D has a 

realization if there is a weighted graph G whose node set contains X, and that 

a"* . satisfies d{u,v) — D{u,v) where d{u,v) is the distance between two nodes in 

the graph. In what follows we will assume that our matrices D are symmetric 
^ . and have zeros on the diagonal; in phylogenetics these are called dissimilarity 

matrices . A non-negative dissimilarity matrix that has a graph realization is 
called a distance matrix. The Hakimi- Yau problem is therefore: 



Problem 1 (Hakimi- Yau). Given a distance matrix on a set X, find a graph 
G which is a realization of D so that the sum of the edge lengths of G is 
minimized. 

The general problem is notoriously difficult [S], however the case when G is 
restricted to be a tree (and X corresponds to leaves of the tree) has been well 
understood. Distance matrices realized by trees are said to be tree metrics, and 
the main result is the classic tree-metric theorem piBl [Tnill3| : 

Theorem (Tree-Metric Theorem). Let D be non-negative dissimilarity ma- 
trix on X . Then D is a tree metric on X if and only if, for every four (not 
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necessarily distinct) elements i,j, fc, I £ X, two of the three terms in the follow- 
ing equation are equal and greater than or equal to the third: 

D{i, j) + D{k, I), D{i, k) + D{3, 1), D{t, I) + D{j, k) (1) 

Furthermore, the tree T with leaves X that realizes D is unique. 

The four point condition is therefore a necessary and sufficient condition 
for a matrix to be reahzed by a tree. 

The algorithmic question of how to construct a tree from a tree metric is 
particularly important for the reconstruction of phylogenetic trees from genomic 
sequences, because the tree-metric theorem suggests the approach of estimat- 
ing pairwise distances between the sequences and then reconstructing the tree. 
Methods that reconstruct the correct unique tree given a tree-metric, are called 
consistent, and biologists are particularly interested in consistent methods that 
are able to reconstruct the correct tree topology even with inaccurate distance 
matrices. Numerous polynomial time consistent algorithms have been proposed, 
of which we mention the classic Buneman method jS] and the neighbor-joining 
method 8 . The neighbor-joining algorithm (NJ) is very popular because it 
is fast and performs well in practice; we mention the Buneman construction 
because the proof of our main result is based on it. 

In this paper we propose the following generalization of the Hakimi-Yau 
problem: An m- dissimilarity map is a map D : X™ M (X™ denotes the 
mth Cartesian product of X) with D{xi, . . . ,Xm) = £'(a;7r(i), • ■ • ,x.i^^„i)) for all 
permutations tt G Sm and D{x, x, . . . , x) = 0. We say that a graph G realizes 
D if the node set of G contains X and for every xi, . . . , Xm G X, the weight of 
the smallest subgraph in G containing xi, . . . , x„i is D{xi, . . . , Xm)- We call an 
m-dissimilarity map that is realizable an ni-distance map. 

Problem 2. Given an m-distance map £) on a set X, find a graph G which is 
a realization of D so that the sum of the edge lengths of G is minimized. 

Our main result is a weak version of the tree-metric theorem for m-dissimilarity 
maps: we show that an m-dissimilarity map does determine the tree it comes 
from (as long as m is small compared to the number of leaves) although we give 
no explicit criterion for an m-dissimilarity map to come from a tree. We also 
discuss the ramifications of our result for phylogenetic tree reconstruction, in 
particular the relationship between our reconstruction theorem and maximum- 
likelihood methods for phylogenetic trees. 

2 Reconstructing Trees 

Let r be a (finite) tree, with a positive weight w{e) assigned to each edge e. Let 
L{T) denote the set of leaves of T. For any V C L{T) of leaves, let \V] denote 
the smallest subtree of T containing V . For T' any subtree of T, let w(T') be 
the sum of the weights of the edges of T' . The main result of this paper is 
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Theorem. Let T be a tree with n leaves and no vertices of degree 2. Let to > 3 
be an integer. Lf n > 2m — I, then T is determined by the set of values w{[V]) 
as V ranges over all m element subsets of L(T). If n = 2m — 2, this is not true. 

In other words, if to is fixed and the number of leaves in a tree T is suf- 
ficiently large, then T is uniquely determined from the TO-dissimilarity map 
corresponding to the weights of the subtrees on m leaves. 

We will engage in the following abuse of notation: when S C L{T) and 
i^i,. . . , Ur S L{T), we will write vi . . .Vr to denote {ui, . . . , Vr} and we will write 
Svi . . .Vr to mean S U . . . , Vr}- We will also write w{V) to mean 

Let T be a tree and let i, j, k and I be distinct members of L{T). We will 
say i and j are opposite k and I if [ijkl] is of the following form: there are two 
(distinct) vertices v and w oi T such that [ijkl] consists of five edge distinct 
paths: a path from i to w, a path from j to v, a path from v to w, a path 
from ui to fc and a path from w to I. We will denote this state of affairs by 
(i,jj A;, I) and we will denote the path from i; to w by jijki- It is clear that ^ijki 
is well defined; in phylogenetics it is known as the Buneman index of the split 
(i, j; k, I). Finally, if 5* is a set, (^'^J denotes the set of to element subsets of S. 

In order to prove the main theorem, we first show that if T is a tree with 
no vertices of degree 2 and having n leaves {n > 2m — 1, to > 3) then we can 
reconstruct the topology of T. We will rely on the well known result that T can 
be recovered as an abstract graph (without determining the edge weights) by 
knowing the signs of the Buneman indices, i.e. for which i, j, k and / we have 
{i,j;k,l). The following Lemma that shows that this data can be recovered 
from the values ^([y]) as V ranges over (^^•'). 

Lemma. Leti, j, k and I be distinct members of L{T). Then {i,j;k,l) iff there 
IS an Re such that 

w{Rij) + w{Rkl) < w{Rik) + w{Rjl) = w{Ril) + w{Rjk). (2) 

Proof Let R e (^^^^^'^^'1'''''^). Let f be the tree that arises by contracting [R] 

to a point. We will denote the images of i, j, etc. in T by i, j, etc. We note 
that i? is a leaf of T. Now, 

w{Rij) — w{Rij) + w{R), 

so (jSJ holds iff the corresponding statement holds in T. 

If {i,j;k,l) then clearly (i,j;k,l). We claim that, conversely, if (i,j;k,l) 
then there exists some R such that {i,j]k,l). To see this, let e be an edge in 
jijki- Removing e from T divides T into two trees, let Li and L2 be the leaves of 
these trees. Now, \L{T) \ {i,j, k,l} \ — n — A> 2m — 5 so either Li \ {j, j, fc, 1} or 
L2 \ {* , Jj k, 1} has at least to — 2 members, without loss of generality suppose the 
former does. Then take R G {^^^^-2 '^^) ^'^'^ ^^^^ have e ^ R so {i,j; fc, I). 

The previous two paragraphs reduce us to the case where \R\ = 1, that is, 
where to = 3 and R = {r} for some r £ L{T). We may delete all leaves other 
than i, j, fc, I and r. Now the claim is reduced to the case where to = 3 and 
n = 5, and can be verified by checking it for all trees with five leaves. □ 
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Remark: One can check that, if the subtree weights are modified by less 
that e/2, where e is the weight of the smallest edge, the reconstructed tree will 
not be altered. This bound of e/2 is the same as for the classical case of m = 2 

m 

We still need to show that we can recover the edge weights of T. Continue 
to assume T has no vertices of degree 2, \L{T)\ = n and n > 2m — 1. By the 
Lemma, the values w{V), V G (^fj^)i determine the graph of T, so we may 
assume that this is known. 

Let e be an edge of T. We first consider the case in which neither of the 
endpoints of e is a leaf. Then we can find i, j, k and I in L{T) such that (i, j; k, I) 
and ^ijki = e (here and at many points in the future, we use that we already 
know the graph of T). Moreover, using n — 4 > 2m — 5 as before, we can find 
R e such that e^[R]. Then 

w{Rik) + w{Rjl) - w{Rij) ~ w{Rkl) = 2w{e) 

so we can determine w{e). 

Let V e L{T), let Cy be the unique edge incident to v. Let us assume that 
we have already computed w{e) for all e G T not of the form e„. Then, for all 
V C {^^^) , we can compute 

Now, let i and j be distinct members of L{T). For any S G (^^"^^\''"'^) we 
have 

Thus, we can determine the wlcy) up to a common additive constant. We may 
determine that constant from w{V) for any V G (^fj"*)- 

It remains to show that if n = 2™ — 2 it may not be possible to reconstruct 
the tree. Let T be the following tree: the vertices of T are known as ui, . . . , 
Vn and W2, ■ ■ ■ Wn-i- The edges of T are of the following two forms: (wi-i, Vi) 
and {vi,Wi). We will set wi = vi and w„ = d„. Clearly, L{T) = {wi = 
vi, W2, . . • , Wn-1, Wn = Vn} ■ The cdgc wcights may be chosen arbitrarily. 

Let T' be the same tree except that the edges {vm-i,Wm-i) and {vm,Wm) 
are deleted and replaced by {vm, Wm-i) and {vm-i, Wm)- We place the edges of 
T and those of T' in bijcction by making Vm-iWm^i correspond to VmWm-l^ 
VmWm correspond to Vm-iWm and pairing all other edges in the obvious way. 
Assign the weights to the edges of T' that are assigned to the corresponding 
edges of T. 

We claim that, for any V G (^fj^) = (^m'')' same edges (using the 
above bijection) appear in [V] and [V]' , where [V]' denotes the minimal subtree 
of T' containing V. To prove this, we consider divide the edges e of T into three 
types. 
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First, we could have e — {vi,Wi). Then e G [V] iS Wi € V and the same is 
true for [V]' . Secondly, we could have e = (vi^i,Vi) with i ^ m. Then e S [V] 
iff ifj and Wk ^ V for some j<«<« + l<A:. This also holds for [V]' . 

So far, we have used the fact that \V\ = m, the previous paragraph is correct 
for any V C L{T). We now consider the final case, e = {vm-i,Vm)- Removing 
e from T divides L{T) into two sets Li and L2, each of size m — 1. As \V\ = m, 
F n ii and y n L2 ^ so e G [y]. Similarly, e G [F]'. □ 

Our main result shows that, when n is large enough compared to to, we can 
reconstruct a tree from the weights of its m-leaf subtrees. However, if we are 
simply given an w-dissimilarity map D : (j-^^ R, we do not know how to test 
whether this map comes from a tree. 

When m — 2, this is given by the tree metric theorem. When to is larger 
than 2, an obviously necessary condition is that, for every R G (^1^2) ^^'^ J' 
k and Z G [n] \ i?, two out of three of the following expressions must be equal to 
each other and greater than or equal to the third: 

D{Rij) + D{Rkl), D{Rik) + D{Rjl), D{Ril) + D{Rjk). 

Moreover, we can impose the combinatorial requirement that when we consider 
the above equation with the same (i, j, fc, I) and different R and R' , that the same 
one of the three terms above is minimized. However, by counting dimensions, 
we can see that this condition is not adequate in any case except n = 5, m = 3. 

3 The Tropical Analogy 

In this section, we describe a connection between subtree weights and an area 
of algebraic geometry known as "tropical geometry" , and inquire whether the 
analogy can be made tighter. The basic reference for our discussions is 
although we reverse the sign conventions of that paper to more closely match 
those occurring elsewhere in this one. 

Let / — X^eS-E fei---e„^V' ' ' ' ^rT be a polynomial in n variables, where the 
/ei-.-e,, are nonzero. Define Trop / to be the subset oi w = {wi, . . . ,Wn) G 
R" such that, of the collection of numbers ^i'^i where e runs over E, 

the maximum occurs twice. If / G K[xi,...,Xn\ is an ideal, set Trop / = 
H/e/Trop/. 

Now, consider the case of a polynomial ring whose variables are indexed by 
the m element subsets of [n], we will write these variables as ps for S G ('^'). 

Let w G r(">) . The statement that the maximum of 

WRij + WRkU WRik + WRjl, WRil + WRjk 

occurs twice precisely says that 

w G Trop {pRijPRki -PRikPRji +PRiiPRjk)- 

So, if w arises from the m-leaf subtree weights of a tree, the w G Trop /, 
where / is any polynomial of the form prijPrm - PRikPRji +PRMPRjk, i < j < 
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k < I. Such polynomials are called the three term Plucker relations. It is well 
known that all of these relations lie in the ideal of the Grassmanian G{m, n). 

In the case where m = 2, it was shown in that Trop G{2,ri) is exactly 
the space of tree metrics. This raises several natural problems: 

Problem 3. Does the space of m-leaf subtree weights lie in Trop G{m, n)l 

(It can be shown that the two spaces are not equal: Trop G(m, n) has larger 
dimension.) Assuming the answer to the above problem is "y^s", and inspired 
by the bijection between tree metrics and points of Trop G(2, n), we can ask for 
more. 

Problem 4. Is there a map G(2, n) G(m, n) with image X such that Trop X 
is the space of m-leaf subtree weights? 

A positive answer to the above problem is known in the case where m — 3. 
Write a point of G{m, n) as a matrix with n rows and m-columns, considered up 
to the right action of GL„. One can check that the following map Mat2xn 
Mat^xn descends to a map G(2, n) — > G(3, n): 



This map takes a tree metric to twice the corresponding 3- leaf subtree weight. 
We have not found such a "geometric lifting" for m > 3. 

4 Applications 

The fundamental problem in phylogenetics is to reconstruct trees from sets of 
sequences related by an evolutionary tree. The sequences can be DNA or protein 
sequences, or more generally can encode the order of genes in a genome or other 
evolving features. Some of the most popular methods for reconstructing trees are 
distance based methods. Distance based methods begin by estimating pairwise 
distances between the sequences, thus leading to a dissimilarity map (although 
it need not be an actual tree-metric). A tree is then reconstructed from the 
dissimilarity map, and it is hoped that the topology of the tree is correct. 

Distance-based method are typically based on maximum-likelihood estimates 
of the pairwise distances. Suppose that the sequences are labeled s^, . . . , s" and 
that position k in the jth sequence is denoted by s^. For a pair of sequences i, j 
we assume that one of them, s^ has a the k position picked from a distribution 
ql^ , and that the other is related by a multiplicative substitution process to the 
first sequence, i.e. the fcth character in s* is distributed according to P{s'k\sl,, t) 

w\YeYeY,bP{sl\siM)P{si\siM) = P{skk'\si,ti+t2). Furthermore, we assume 
that the substitution process is reversible so that we can switch the indices i,j. 
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Then the pairwise distances are computed by 

d,j =argmaxt {\\ql^P{sl\si,t)^ ■ 
Since the terms q^^ do not depend on t we can write this as 

dij = argmaxt ^^Jf (414'^^ • 

When the number of sites is large the consistency of maximum hkehhood impHes 
that these estimates will converge to the true branch lengths in the tree (assum- 
ing that the probabilistic model is correct). Thus, distance methods based on 
ML pairwise distance estimates can be viewed as generating approximations of 
the actual ML tree. 

In practice, the use of short sequences can lead to inaccurate estimates, 
especially for longer branch lengths, and this is a common source of error for 
methods such as neighbor joining jHj. A number of solutions to this problem 
have been proposed: for example variants of neighbor joining exist that are 
less sensitive to long branch lengths T. Ranwez and Gascuel have shown that 
long pairwise branch lengths can be corrected by considering ML estimates of 
distances using three taxa j?]. Quartet methods such as quartet puzzling 
attempt to reconstruct the tree from more reliable quartets rather than pairwise 
distances. However a major problem with quartet methods has been how to 
reconcile the different topologies of the quartets into one tree. 

Our approach can be seen as generating a better approximation to the ML 
tree, by relying on more accurate m-tree weights rather than pairwise distance. 
By avoiding the need to reconcile diverse topologies, we avoid the difficulties of 
standard quartet methods. Furthermore, as m increases, the splits can be more 
accurately identified and thus a more accurate tree reconstructed. Although 
the computational complexity of subtree weight reconstruction grows rapidly 
with m, it remains polynomial, and furthermore many key steps are trivially 
parallelizable. 
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